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Abstract 

We consider the problem of identifying a team of skilled individuals for collaboration, in the presence 
of a social network. Each node in the input social network may be an expert in one or more skills - such 
as theory, databases or data mining. The edge weights specify the affinity or collaborative compatibility 
between respective nodes. Given a project that requires a set of specified number of skilled individuals in 
each area of expertise, the goal is to identify a team that maximizes the collaborative compatibility. For 
example, the requirement may be to form a team that has at least three databases experts and at least 
two theory experts. 

We explore team formation where the collaborative compatibility objective is measured as the density 
of the induced subgraph on selected nodes. The problem of maximizing density is NP-hard even when 
the team requires a certain number of individuals of only one specific skill. We present a 3-approximation 
algorithm that improves upon a naive extension of the previously known algorithm for densest at least 
k subgraph problem. We further show how the same approximation can be extended to a special case 
of multiple skills as well. Our problem generalizes the formulation studied by Lappas et al. [KDD '09]. 
Further, they measured collaborative compatibility in terms of diameter and the spanning tree costs. Our 
density based objective also turns out to be more robust in certain aspects. 

Experiments are performed on a crawl of the DBLP graph where individuals can be skilled in at most 
four areas - theory, databases, data mining, and artificial intelligence. In addition to our main algorithm, 
we also present heuristic extensions to trade off between the size of the solution and its induced density. 
These density-based algorithms outperform the diameter-based objective on several metrics for assessing 
the collaborative compatibility of teams. The solutions suggested are also intuitively meaningful and scale 
well with the increase in the number of skilled individuals required. 



* Yahoo! Inc, Santa Clara, CA, USA. E-mail: amitagOyalioo-inc . com. 

^Google Research, Google Inc., Mountain View, CA, USA. E-mail: dassarmaOgoogle . com. Part of the work done while at 
Georgia Institute of Technology, GA. 



1 Introduction 



A team formation problem consists of forming a team from a large set of candidates such that the resulting 
team is best suited to perform the assignment. The main difficulty in providing an automated way to form 
a team from the solution space is the categorization of the desired attributes quantitatively. In spite of this, 
the problem has attracted many researchers and various interesting approaches have been suggested over the 
years, as we mention them in the related work section. In this spirit, we study this problem in the context of 
social network with a goal to identify the most collaborative team that satisfies the skill-set requirements of 
the project. Certainly, the naive approach would be to just find the candidates that match the requirements 
the best. However, considering the social network associated with the candidates add a value to the solution 
becasue, intuitively, such team is more likely to demonstrate better collaborative compatibility. This is 
also evident in practice, where many companies tend to promote employee referral program while hiring a 
candidate. 

We model this team formation problem in the social network context by considering the network graph 
that connects the individuals, wherein each individual is represented by a node in the graph and an association 
between individuals is represented by an edge in the graph. In a more generic sense, each node can be assigned 
a set of desired attributes and an edge can be assigned a weight representing the coUaborativeness between the 
individuals it is connecting. Note that, this model could further be extended in multiple dimensions and we 
believe that the work we present in this paper could be a good starting point with this regard. For example, 
one possible extension to this graph model would be a hypergraph model wherein we can accomodate many 
criteria - weight associated with hyperedge could define the colloaborative compatibility between the set of 
nodes (instead of just two nodes), hyperedge could also be used to denote the set of nodes that represent a 
certain group, etc. 

In this paper, as a starting point, we define the problem where each node is associated with a set of skills 
and a weight of the edge reflects the cohesiveness between two connecting nodes(users), and a goal is to form 
a collaborative team for a project that requires a specified number of people in each of a set of skills . In 
this setting, two users can collaborate better as a team if they have a high-weight edge (strong affinity for 
interaction) between them. Specifically, consider the following example where a social network of computer 
scientists is presented. Each user is skilled in a subset of areas between theory, databases and data mining. 
A company wants to hire people for a predetermined project. The goal of the project requires that the team 
consists of at least three database researchers, at least two theory researchers, and at least one researcher with 
expertise in data mining. Presented with the social network where edges refiect collaborative interactions, 
how should the company go about hiring a team for the project? 

A special case of this problem was studied in [T^] . They consider team formation when the team requires 
at most one person each in a set of different skills. Our problem formulation generalizes this by allowing the 
team to require multiple skilled individuals in any skill. Clearly there are projects where multiple people with 
specific skills may be desired. It turns out that this generalization makes the problem significantly harder 
and more interesting. For example, the problem is no longer trivial even when the social network contains 
users that are either skilled or not skilled in just one specific area. Suppose a project requires eight database 
researchers, and the social network contains people who are either skilled in databases or not, how does one 
go about choosing the team? We shall mention the complexity as well as algorithmic results for this special 
case as well shortly. 

A critical question in team formation based on a social network is to determine the collaborative quality 
of a team. The edges specify the collaborative compatibility of two nodes. However, given a subset of say k 
nodes in the social network (let us even say these k nodes are connected), how do we know how collaborative 
this team is? To tackle this, [12 suggested two objectives: one based on the diameter of the subgraph 
induced by these k nodes, and another based on the spanning tree cost of these nodes; and demontrated the 
potential of these ideas through experimental results. These objectives can certainly be applied to solve the 
problem we define in this paper. In fact, we provide the extention to their diameter-based algorithm, prove 
the 2-approximation bound and also complement with experimental results. Similarly, the minimum-spanning 
tree based approach could also be extended to the problem defined here. However, the main focus of our 
paper is a novel density based objective that we propose for this problem; therefore, the majority of this 
paper's contributions are related to this density objective. Specifically, we define the collaborative affinity of 
a team of k nodes to be proportional to the density of the induced subgraph. Using density as a measure 
of the quality of an induced subgraph of nodes has certain intuitive merits over using diameter or minimum 
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spanning tree costs; we describe these in section [3] 

We briefly summarize the problem definiton here: given a set of skills l,2,...,t, and requirements 
fci, ^2, • ■ • , fct, and a social network of nodes connected by (weighted) edges, the goal is to pick a subset 
of nodes such that at least ki distinct nodes possess skill i, for 1 < i < t. The same node, however, may 
contribute to two different skills. The objective value of the solution is the density of the induced subgraph 
on these nodes. The goal is to maximize this objective. Notice that the number of returned nodes may be as 
small as kmax = max^ ki or be even larger than ki. We now summarize the contributions of this paper. 
Our Contributions. 

• We present a novel problem definition for team formation to maximize collaborative compatibility. 
The constraint of the problem requires the team to comprise of at least a specified number of skilled 
individuals in each of a set of skills. This generalizes previous work that required forming a team with 
at least one skilled individual in each of a set of skills. 

• As a measure of collaborative compatibility, we suggest a density based objective. Density is a novel 
metric for this domain and we show that it has certain desirable properties for measuring compatibility. 
Our density based team formation problem also generalizes previous graph algorithms work on finding 
densest subgraphs with size constraints. 

• We address the collaborative team formation problem when the team requires one or more skills. We 
show that optimizing even the special case of a single skill is NP-hard under our density-based metric, 
as well as the previously suggested diameter-based metric. The main theoretical result of the paper is 
to present a novel 3-approximation algorithm for the density based team formation problem for both 
single as well as a special case of multiple skills. This improves upon a naive extension of previous work 
on size constrained densest subgraph problems. We also show how previous work on a 2-approximation 
for the diameter-based objective can be extended to our generalized problem. 

• We present several heuristic algorithms that build on our 3-approximation for density-based team 
formation. These algorithms trade-off between the size of the returned solution and the density, while 
respecting the constraints on the skill requirements. 

• We perform experiments on all these algorithms on the DBLP graph. Experiments show that density- 
based algorithms perform well in practice, identifying tightly knit and highly skilled teams and also 
scale well with the size of the team and skill requirements. 

• We measure qualitative evidence of the teams reported by both denisty-based and diameter-based algo- 
rithms and show that the density-based algorithms compare favorably to the diameter-based algorithms 
on a number of different metrics. Further analysis of the teams (by inspecting the members of the team) 
reported show that the density-based approach suggest the teams that are more intuitive and meaningful 
compared to diameter-based teams. 

Overview. We mention related work in Section O The various problem definitions, notations and some 
properties are formalized in Section [31 Our theoretical contributions, including the main 3-approximation 
algorithm for our density based objective are described in Section 01 The theoretical work on a diameter 
based objective is presented in Section |S1 Finally, some additional heuristic algorithms and experimental 
results are detailed in Section [6l 

2 Related Work 

Various interesting approaches for team formation have been studied over the years. In operations re- 
search [3l|4l[T4l[T5], the problem is defined as finding an optimal match between people and demanded 
functional requirements. It is often solved using techniques such as simulated annealing, branch-and-cut 
or genetic algorithms [5l ll41IT5] . Another interesting problem formulation requires taking into consideration 
the psychological aspects of the individuals involved in order to form a team of efficient collaboration, e.g, the 
work by Fitzpatrick and Askin [5] , and Chen and Lin in |3] . Although all these approaches are interesting, 
they do not use the possible presence of a social graph structure between the individuals. Therefore, these 
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approaches are complementary to ours. Further, Gaston et al. ^\ provide an experimental study on the effects 
of a graph structure among individuals on the performance of a team. 

Our problem formulation differs from these fundamentally by requiring a solution where the optimality 
is determined based on the properties associated with a social graph structure among the individuals. In 
particular, we aim to form a team that contains at least ki nodes of skill i such that the density of the 
resulting solution subgraph is maximized. A similar problem has been addressed by Lappas et. al. [12j . They 
try to find a team that contains at least 1 node for each skill z, with the cost of a solution measured in terms 
of either a diameter or a minimum spanning tree. Our problem definition generalizes this requirement and 
suggests a new density based measure for solution's objective. 

The problem of finding size-bound densest subgraphs is well-studied. Finding a maximum density sub- 
graph on an undirected graph can be solved in polynomial time [51[T3]. However, the problem becomes 
NP-hard when a size restriction is enforced. In particular, finding a maximum density subgraph of size ex- 
actly k is NP-hard [2l[5] and no approximation scheme exists under a reasonable complexity assumption [9]. 
KhuUer and Saha [TU] considered the problem of finding densest subgraphs with size restrictions and showed 
that these are NP-hard. Khuller and Saha [TU] and also Andersen and Chellapilla [T] gave constant factor 
approximation algorithms. Our problem definition varies from these because we not only require to find the 
maximum density subgraph of size at least fc, but, we also require that this subgraph contain ki nodes of 
property (or skill) i such that fc = fci + /c2 + ... -f A:„. Thus, we also generalize past work on finding size-bound 
maximum density subgraphs. 

3 Problem Definition 

Notation. Let X = {1, . . . , n} denote a set of n individuals and A — {ai, . . . , am] denote a set of m skills. 
Each individual i is associated with a set of skills Xi C A. If aj S ATj , then an individual i has skill aj . For 
each skill a, we define its support set, S{a), as the set of individuals in X with skill a. That is, S{a) = £ X 
and a G X{\. A task 7" is a set of pairs where each pair, \aj,kjl, specifies that at least kj individuals of skill 
Qj are required to perform the task. 

Let G{X, E) denote the undirected, weighted graph representing the social network associated with the 
set of individuals X. We use the notations E{G) and V{G) to represent the edge set and vertex set associated 
with the graph G. If X' C V{G), we use G[X'] to denote the subgraph of G induced by the nodes in X' . 
Further, W{X') denotes the sum of the edge-weights associated with all the edges in the subgraph induced 
by the nodes in X' . We also define a distance function between any two node i, i' in a graph G as the sum of 
the edge- weights along the shortest path between i and i' in G. Further, without loss of generality, we assume 
that the graph G is connected; we can transform every disconnected subgraph to a connected one by simply 
adding an edge that denotes zero collaborative compatibility. Given a measure of collaborative compatibility 
Cc(), we now formalize the problems considered in this paper. 

Single Skill Team Formation (sTF). Given a set of n individuals X — {1,...,?t,}, a graph G{X, E), task 
T — {< a,k >}, find X' C X, such that \X' n S{a)\ > k, and the collaborative compatibility Gc{X') is 
optimized. 

Multiple Skill Team Formation (mTF). Given a set of n individuals X — {1, . . . , n}, a graph G{X, E), 
task T — {< ai,fci >,< 02,^2 >,...,< a,n,km >}, find X' C X, such that \X' n S{aj)\ > kj for each 
j G {1, . . . , m} and the collaborative compatibility Cc{X') is optimized. 

The main metric that we consider for collaborative compatibility for sTF ami mTF is the following density 
based objective. In addition to this, we consider a diameter based objective as well (suggested in [12]) for 
comparison. 

Maximum Density(D). Given a graph G{X,E) and a set of individuals X' C X, we define the density 
collaborative compatibility of X', denoted by Cc-D{X') to be the density of the induced subgraph G[A:'']. 
Recall that the density d{G) of a graph G is defined as d{G) — \ v(G) \ • higher the value of the density, 
the better is the collaborative compatibility. An optimal solution X' C X, is the team that can perform task 
T and has maximum density. 

Minimum Diameter(R). Given a graph G{X,E) and a set of individuals X' C X, we define the diameter 
collaborative compatibility of X', denoted by Cc-R{X'), to be the diameter of the subgraph G[X']. Recall 
that the diameter of a graph is the largest shortest path between any two nodes in the graph. An optimal 
solution X' C X, is the team that can perform task T and has minimum diameter. 
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In the following sections, we refer to the Single Skill Team Formation (sTF) and Multiple Skill Team 
Formation (mTF) problems with collaborative compatibility Cc-R as Diameter-sTF and Diameter-mTF, 
respectively. Similarly, for the collaborative compatibility Cc-D we refer to the corresponding problems as 
Density-sTF and Density-mTF respectively. 

Properties. We now describe some properties of the maximum density objective. Notice that neither of 
these properties hold on Diameter-sTF or Diameter-mTF. For brevity, we mention the intuition without a 
rigorous definition or proof. 

Strict Monotonicity. If a communication edge (with positive weight) is added between two nodes in the 
solution set for the Density-sTF or Density-mTF problem, then the collaborative compatibility objective 
Cc-D for the solution necessarily increases. Similarly, if a communication edge already present is deleted, 
then the Cc-D objective value decreases. This seems intuitive as an added collaboration between two people 
in the team enhances the quality of the team. However, in the case of diameter, adding or deleting an edge 
may not affect the solution at all. 

Sensitivity. The Cc-D value for Density-sTF or Density-mTF does not increase or decrease radically upon 
adding or deleting an edge. Specifically, it can only change to an extent depending on the weight of the added 
or deleted edge, compared to the total weight of edges in the solution. However, adding or deleting an edge 
can radically change the diameter (for example make it finite from infinite) for an induced subgraph; this 
implies that the diameter objective is highly sensitive to small change. 

The properties for density based objectives fall out of the fact that adding or deleting edges only gradually 
alters the density of a solution subgraph. Diameter based objectives (or even the minimum spanning tree based 
objective suggested in [12] that we do not consider in this paper) are not smooth in this sense; altering the 
graph slightly can change the objective radically. These properties make density based objectives somewhat 
more suitable. One drawback, however, of density as an objective arises from the fact that the optimal 
solution may contain disconnected components. Notice that this is not the case for the diameter based 
objective, however, although the solution returned is connected it may be of large size including non-skilled 
(undesired) nodes that are required to ensure the connectivity. To ensure the connectivity property for the 
density-based solutions, in the experimental section we suggest several heuristic algorithms. 

Eventually, the quality of teams produced by different definitions needs to be evaluated (potential for 
collaboration) based on the measures neutral to these definitions; we make such objective comparisons in the 
experimental section. 

4 Density-based objective 

In this section, we claim that Density-sTF and Density-mTF are NP-hard problems. We then present the 
algorithms s-DensestAlk (Algorithm [1]) and m-DensestAlk (Algorithm [J) for Density-sTF and Density-mTF, 
respectively. Further, we prove that Density-sTF achieves 3-approximation factor. 

Theorem 1 Density-sTF and Density-mTF problems are NP-complete. 

Proof: We prove the claim by a reduction from the Densest at least k subgraph (DalkS) problem defined in 
|10j . An instance of DalkS consists of a graph G{X,E), and a constant fc, and the solution is a maximum 
density subgraph with at least k nodes. We transform it into an instance of Density-sTF j>ioh\em by defining 
a skill a for every node v G V in which case a solution would be a maximum density subgraph with at least 
k nodes that have skill a. And since skill a is defined for every node in G, it is easy to see that X' C X is the 
solution to the problem Density-sTF iS it is a solution to the problem DalkS. The problem Density-sTF is a 
special case of Density-mTF which implies that Density-mTF is NP-hard. I 

4.1 3-approximation algorithm for Density-sTF 

Intuition: To begin with, the algorithm s-DensestAlk (Algorithm [Ij accepts the graph and the skill require- 
ments as an input. It then finds the densest subgraph and removes it from the input graph and adds it to 
the solution subgraph (which is initially empty). It then checks if the solution subgraph satisfies the skill 
requirements. Until the solution subgraph constructed meets the skill requirements, the algorithm continues 
to iterate through the process of finding the densest subgraph from the remaining input graph and adding it 
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to the solution subgraph. Since in each iteration the algorithm adds the densest subgraph, it is ensured that 
the solution subgraph has sufficiently high density. Note that although we are able to prove that the algorithm 
guarantees a 3-approximation ratio in terms of density, no bound on the size is guaranteed. We overcome 
this drawback by applying various simple heuristic algorithms which are described later in the section 16.21 

Details: The algorithm s-DensestAlk(G, T) takes as input the social graph G and a task T — {\a,ki} 
where at least k individuals/nodes of skill a are required to perform the task T. As explained intuitively, 
the algorithm then proceeds through multiple iterations. In each iteration, i, it finds the maximum density 
subgraph of G^, say -ff^+i, removes it from Gi using the routine shrink(Gi, Hi^i) and constructs a new 
solution subgraph IJi+i using the routine union{Di, Hi^i). The routine shrink(G, H) removes H from G 
such that for each v E {G — H), if v has I edges to the vertices in H, then it adds I self-loops to v with the 
corresponding edge- weights. Inside the routine union{D, H), then for each loop, we look at its corresponding 
edge, say e{u,v), in the original input graph, G, and if u G D,v G H (or vice- versa), we replace the loop by 
an edge e{u,v). Finally, once the loop-termination condition is satisfied, the algorithm then examines each 
of the intermediate solution subgraphs, Di, constructed in previous iterations and adds sufficient number of 
skilled nodes to it so that each Di satisfies the skill requirement. The algorithm then picks up the one with 
the highest density as the final solution subgraph. 

Our algorithm is very similar to the Dens est AtleastK algorithm in |10| that calculates the maximum 
density subgraph containing at least k vertices without any skill constraints imposed. The naive extension 
would be to just add k skilled nodes to the solution returned by algorithm DensestAtleastK. And since 
their algorithm guarantees an approximation factor of 2 for density, this naive extension would guarantee 
an approximation factor of 4 (proof omitted for brevity). But, since the additional k nodes are picked at 
random the solution may suffer from many disconnected components making it practically infeasible to be 
of any use. Therefore, we propose the algorithm s-DensestAlk that differs mainly in the loop-termination 
condition imposed. This condition ensures that the resulting solution satisfies the constraints of at least k 
skilled nodes, improves the approximation ratio to 3 from 4, and has good connectivity properties. 

Although the proof for 4-approximation is simple, it turns out that proving a 3-approximation to Density- 
sTF is significantly harder. While the algorithm is simple, the analysis is fairly detailed. The key idea is to 
consider various cases about the returned subgraph and carefully examine the density of each component. 
The analysis is similar to \W at the high level. However, due to the skill-set constraints, several sub-cases 
need to be considered. 



Algorithm 1 s-DensestAlk(G, T) 

1: Do ^ 0, Go ^ G, i ^ 

2: while \Di n S{a)\ < k where T = {ja, ki\ do 

3: maximum-density-subgraph(Gi) 

4: Di+i -It- union{Di,Hi+i) 

5: Gi+i <— shrink{Gi, Hi^i) 

6: i <— i + 1 

7: end while 

8: for each Di do 

9: ria ~ number of nodes of skill a in Di 
10: Add ■max{k — ria, 0) nodes of skill a to Di to form D'^ 

11: end for 

12: Return Z?^ which has the maximum density 



Theorem 2 The algorithm s-DensestAlk achieves an approximation factor of 3 for the Density-sTF problem. 

Proof: Let H* denote an optimal solution and d* — ^\h' )\ denote density of the optimal solution. 

If the number of iterations is 1, then Hi is the maximum density subgraph that contains at least k nodes 
of skill a. Therefore, H* = Hi and the algorithm returns it. Otherwise, say the algorithm iterates for I > 2 
rounds. There can be two cases: 
Case 1: There exists an I' < I such that 
W{Di,_i n H*) < and WiDi, n H*) > ^i^. 

Case 2: There exists no such I' < I. 
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G' = D|. intersection H* = Dn union G" 

X = shrinl((D,, Di) D,^ = intersection G' 

0,2 = slirinl<(bi, D„) S" = shrinl({G', D,,) 



Figure 1: Di, = U A2 U X 



Before analyzing the two cases in detail, note that by construction density{Hi) < density{Di) < density {Di-i). 
We now consider case 2 first and later case 1. 
Proof for Case 2. 

Since the algorithm terminates after I iterations, D; contains at least k nodes of skill a. Further, we know 
that for each j <l-l, W{Dj D H*) < ^^^^ 
^W{GjnH*) > 

^ inCinH*)! ^ 2\V{H')\ 

Gj contains a subgraph of density > ^ 

=^ density (Hi) > ^ 
=^ density{Di) > ^ 

Thus, Di has density > ^ and contains at least k nodes of skill a. Therefore, the algorithm indeed returns 
a subgraph of density at least > 
Proof for Case 1 

W{Di,_i n H*) < and n H*) > 

^ W{Gi' n H*) > -^i^ where Gi> = shrink{G,Du_i) 

W{Gi,nH*) ^ WiH") _ dl 
^ \V{G,,nH-)\ ^ 2|y(H*)| - 2 

=> Gi' has a subgraph of density > ^ 

density {Hi> ) > ^ (iJ;/ is densest subgraph of G) 
density {Di' ) > ^ 

Now, let us divide Case 1 into following 4 parts 
(a) < I 

According to step 10, algorithm adds at most k vertices to Di> to obtain the subgraph, say D, with 
density d 

u/in 1 w(H*) 



\V{Di,)\+k - l+fe - 3|V(H»)| — 3 

(b) \V{Di,)\>2k 

According to step 10, algorithm adds at most k vertices to Z)//. Further, we know that density{Dii) > ^ 
therefore, the resulting subgraph, D^, has density 

(c) I < < '2k and n ViH*)\ > 

According to step 10, algorithm adds at most nodes to Di> to form D[, with density, say d. 

i \V{Di,)\ > \V{H*)\ 

ii < 

|y(Z3,,|)+mpr ^ |y(//.)|+mi^- ^ i\V{H')\ ^ 3 
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Figure 2: Ai = A n G' , A2 ^ shrink{Di, Da), G" = shrink{G' , Da) 



(d) I < < 2fc and n V{H*)\ < 



\ViH')\ 



If dj' — density{Dii) > d*, then adding at most k vertices gives a subgraph D'^, with density, say d such 
that 

, _ W{D,,) ^ W(D,,) ^ W(Di,) . r 

H/(Z>,,)l+fc - \V{Di,)\+2\V{D^,)\ ^ 3\V(Di,)\ - 3 

Therefore, Di> is a subgraph that contains at least k nodes of skill a and has density d > We are 
done here. 

Now, assume that dp < d* . 

In the rest of the proof, we divide Di' into subgraphs as explained below and shown in Figure [TJ 
Let G" = Di, nH*. 

Claim 1 W{G') > and density{G') > d* . 

Proof: |y(G')l = I^^IA' n H*)\ < ^^^^ and W{G') = WiDi, n H*) > 

=> density {G') > iv(h-')i > d* . I 

2 

Define i such that density (Hi) > d* and density (Hi^i) < d* . Such an i < T exists due to Claim [Hand 
since df < d* . 

density {Di) — di > d* . 
Let, ni = \V{Di)\. We now consider two sub-cases. 

i ni> ^'•^ ''^ : Add at most fc vertices to Di to get a subgraph D' with density {D'-) = d, such that 



Thus, Z),| is a subgraph containing at least k nodes of skill a and density d > 4- and we arc done here. 



\V{D,)\+k - \V{D.)\ + \ViH'-)\ - 3\V{D.)\ - 3 

ii rij < ^^^^ ''^ : We know that density {G') > d* , density (Hi) > d* and density {Hi j^i) < d* . Therefore, 
G' n Di =/: (j). We now introduce a few definitions and prove claims about them. 

Let, Da = Di n G", A:2 = shrink{Di, Da), and G" = shrink{G' , D a) (Figure: [5]). Further, let 
X = shrink{Dii , Di). 



Claim 2 ly(Ai) > ^^^"P^'^' ~ W{G"). 



Proof: W{G') = W{G") + W{Da) since G" = shrink{G' , Da); but W{G') > (using Claim 

m ■ 

Claim 3 density {D12) > €-. 
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Proof: Recall that for each j < i, density{Hj) > d* . Further, Hj — densest subgraph of shrink{G, Dj-i). 
Therefore, for each v G Hj, the degree of v induced in Hj is at least d*. Therefore, for all v G Di, 
degree{v) > d* (here we abuse notation to denote w's degree induced in Di by degree{v)). I 

For convenience, let = \V{X)\, nv = \V{Dv)\, n,i = |V^(Ai)|, = |V"(A2)|, and n" = \V{G")\. 
Claim 4 W{X) - W{G") > ^(n^^ - n"). 

Proof: Since H^ is the maximum density subgraph of shrink{G, Di'^i), density{Hii) > density{S) 
for any S C Hii,. Further, since X — shrink{Dii , Di), and G" = shrink{G' , Di D G'), we have 
G" C X. Therefore, density{Hj) > density{Hj n G") (for aU i<3 < I'). 
Therefore, W{X) - W{G") 

= E -U+i w{H,) - E -U+i wiH, n G") 

> E ■=.+! denstty{H,){\ H, \ - \ H, D G" \) 

Notice that we have (lower) bounded the density or the weight of each of Da, Di2, and X, the three 
components that add up to Dii . We are now ready to argue about the density of Dii when k vertices 
are added to it. Before initiating this analysis, we briefly state a claim relating the sizes of these 
components. 

Claim 5 ni2 + Ux — n" > ni' — ^^'•^ ''^ 

Proof: This follows using |y(G')| < ''^^f ^' and the definition G" = shrink{G' , Da)- I 
We now complete the analysis. 
d = density{D) > -^^^^ 

_ W{Di)+W{X) _ W(Dii)+W{Di2) + WiX) 

> '-'^-^io2^^ _±mn (^^j^g Claim mm 
> 2 „"+fc (^sing Claim ^ 

^ ^ '^^^"^ n"+fc^^^^ (using Claim O 
>^|^>f (since|<n.). 

Remark: Cases (c) and (d) do not use the bound \V{Dii) < 2k\; so they together subsume case (b), but we 
have presented (b) for clarity. I 



4.2 Algorithm for Density-mTF 

In this section, we present the algorithm m-DensestAlk (Algorithm ^ for the _Densif?/-mTF problem. This 
is an extension of the algorithm s-DensestAlk for the _Densit?/-sTF problem described earlier. The algorithm 
m-DensestAlk accepts input parameters: graph G and task T = {< ai, fci >, < 02, fc2 >i ■ • ■ ; < im; >} 
which requires at least ki individuals of skill to perform the task T. Each iteration within the algorithm m- 
DensestAlk is exactly similar to the s-Densesi^ZA; described earlier except that here the termination condition 
verifies that the solution subgraph contains at least ki nodes with skill ai for i £ {1 ■ -m} and thus satisfying 
the multiple skill requirement instead of single skill requirement. The details of the algorithm are similar to 
that described for s-DensestAlk in the section 14.11 

Theorem 3 The algorithm m-DensestAlk achieves an approximation factor of 3 for the special case of 
Density-mTF problem where each node in the graph has at most one skill. 

Proof: Let m — |T| and k = Ej=i ^3 where kj number of individuals are required of skill aj s.t. < aj, kj >G 
T. Since each node contributes to atmost one skill, an optimal solution, H* , has at least k vertices. The 
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Algorithm 2 ni-DensestAlk(G', T) 

1: Do ^ 0, Go <- G, i 

2: while \Di n S'(aj)| < fcj for any < aj, kj >e T do 

3: -ffi+i <— maximuni-density-subgraph(Gi) 

4: A+i ^ union{Di,Hi+i) 

5: Gi+i ^ shrink{Gi,Hi+i) 

6: i <— i + 1 

7: end while 

8: for each Di do 

9: D[ ^ D, 

10: ior each <ai,fci>GTdo 

11: riaj = number of nodes of skill Oj in Di 

12: Add max{kj — naj,0) nodes of skill aj to D[ 

13: end for 

14: end for 

15: Return D[ which has the maximum density 



proof for m-DensestAlk is analogous to the proof for s-DensestAlk with the only difference that instead of 
adding any k nodes of skill a to DiS, we add kj nodes of skill aj s.t. < aj, kj >^T- I 

We are unable to bound the performance of m-DensestAlk for the general case of Density-mTF problem. 
Futher, the time complexity of m-DensestAlk is 0{kn^) which can be inefficient for very large graphs but 
is manageable at the scale at which we run experiments. Directly using the linear time algorithm for the 
densest at least k subgraph problem in [TlfTO| or 0(n'^ )-time algorithm from [l][TO] for _Densit?/-sTF problem 
would result in a weaker bound i.e. 6 and 4-approximation respectively. In both cases, however, one may 
possibly get many disconnected components. 

5 Diameter- based objective 

In this section, we mention theoretical results for Diameter-sTF and Diameter-mTF. We show that these 
problems are NP-hard (note that the NP-hardness of Diameter-sTF does not follow from any previous work). 
We further present an algorithm MinDiameter (Algorithm [3|) which is an extension of RarestFirst in [12j . 
and prove that it achieves a 2-approximation factor. 



Algorithm 3 MinDiamcter(G, T) 

1: for each < a, fc >G T do 

2: S{a) ^{i\aeXi} 

3: end for 

4: arare = argmin<a,fe> gT |S'(a) 

5: for each i e S{arare) do 

6: for each < a,k > E T do 

7: R,a ^ dk{i,S{a),k) 

8: end for 

9: Ri maXa Ria 

10: end for 

11: i* argmini?i 

12: X' = {i*} 

13: for each < a,k > € T do 

14: X' = X'U{Pathkii*,S{a),k)} 

15: end for 



Theorem 4 Diameter-sTF and Diameter-mTF problems are NP-complete. 
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Proof: The problems Diameter- sTF and Diameter-mTF are in NP because for a given candidate solution, in 
polynomial time, it can be verified that the skill-set requirement is satisfied. We prove that Diameter- sTF \s 
NP-hard by reduction from the 3-satisfiability problem. Consider a 3-SAT instance, say ^ = Ci A C2... A C™, 
where each clause, Cj = (a; V y V z), and {x, y,z} €U = -lUi, W2, -1^2, • • •, u„, -^Un}- Let, C = {Ci, C2, • • 
•,Cm}. Let N,M denote the number of variables and clauses, respectively. We construct an instance of 
Diameter-sTF pTohlem corresponding to the 3-SAT instance ^' using the following rules. 

Rule 1 For each variable x, create two nodes x, -^x in G and set w(x^ ^x) — r' . 

Rule 2 For each clause Cj, create two nodes, Cji and Cj2 in G and set w{Cji, Cj2) — r' . 

Rule 3 Pick any r such that r < r' . For each pair of variables ix,y) where y ^ -ix, set w{x,y) — r. 
Similary, for each pair of clauses {Cf,Cg), where w{Cf,Cg) is not set by rule 2, set w{Cf,Cg) = r. 

Rule 4 For each clause, Cj = {xV yW z), set 

w{Cji,x) = w{Cji,y) = w{Cji,z) — I and 

w{Cj2,x) = w{Cj2,y) = w{Cj2,z) — § and 

w{Cji , u) = w{Cj2,u) — r for each u E U — {x, y, z} 

Rule 5 For each Ui, G U , associate a skill a to node u^, -iWj. And for each Cj G C, associate a skill a 
to the nodes Cji,Cj2- 

Claim 6 In G, d(x, -^x) > r where x, -ix G U . 

Proof: In G, for each variable x ^ ~'x), d{x,y) = d{^x,y) = r and w{x,^x) — r' > r (rule 1,3). 
Further, both x and ^x cannot appear together in any clause Cj G C (pre-processing). Therefore, in 
G, d{Cji,x) — d{Cj2,x) — I and d{Cji,^x) — d{Cj2,^x) = r (rule 3,4). 

d{x, ^x) > r I 

Claim 7 Let X be the subgraph of G and V{X) denote the nodes in X. Let Cji,Cj2 G V{X) where Cj = 
{xW yW z). Then, in X, d{Cji,Cj2) = r iffV{X) n {x,y,z} ^ <j). 

Proof: Assume V{X) Ci {x, y, z} ~ cj). 

In G, for each clause Cf{^ Cji ^ Cj2), d{Cji,Cf) — d{Cj2,Cf) — r and w{Cji,Cj2) — r' > r (rule 2,3). 
Further, for each u Q U — {x,y,z}, d{Cji,u) = d{Cj2,u) = r (rule 4). Therefore, in X,d{Cji,Cj2) > r. 
However, this is a contradiction because, in X,d{Cji,Cj2) = r. 

^V{X)r\{x,y,z]^4>. I 

Claim 8 Let k — N -\- 2M. If^! has a satisfying assignment then G has a sub-graph X' with \X' n S{a)\ > k 
and diameter{X') < r. 

Proof: If has a satisfying assignment, then G has a subgraph X' such that X' contains Cji, Cj2 for each 
clause Cj G C, and u{or -^u) G t/ if u{or -^u) is set to 1 in the satisfying assignment for 5*. Note that in the 
satisfying assignment for 5' either u or -^u appears in the assignment. Thus, X' contains exactly N variables 
and twice the number of clauses. Thus, \X' n S{a)\ — N -\- 2M = k (rule 5). 

Since X' contains either a variable or it negation, for each pair of variables {x,y) G V{X) D U, d{x,y) — r 
(rule 3). Further, in the satisfying assignment for ^ each clause Cj — xVyV z, has at least one of the variables 
set to 1. So, for each pair of nodes {p, q) G V{X)r\C, d{p, q) — r (Claim [7] and rule 3) . Therefore, distance 
between any two nodes in X' is r (rule 4). 

Thus, if ^ has a satisfying assignment then G has a subgraph X' with \X' n5(a)| — k and diameter{X') = r. 

I 

Claim 9 Let k = N + 2M . If G has a sub-graph X' with \X' n S{a)\ > k and diameter{X') < r then \E' has 
a satisfying assignment. 

Proof: If diameter{X') < r then it contains either u or -^u but not both because d(u, ^u) > r (Claim 
[B]). Since k = N -\- 2M, for each variable u € U, X' contains a node corresponding to either u or -^u (not 
both) and for each clause Cj G C, X' contains nodes corresponding to Cji and Cj2 (rule 5). Now, since 
diameter{X') < r, it implies that d{Cji,Cj2) < r. This implies that at least one of the nodes corresponding 



10 



to a;, y, z in Cj is included in the sub-graph X' (Claim 
then ^ has a satisfying assignment. 

Claims [Hand O prove that _Diameter-sTi^ is NP-hard. 
niTF, its NP-hardness proof follows. 



E]). Now, if each variable u e U D V{X') is set to 1, 

I 

Since Diameter- sTF is the special case of Diameter- 

I 



Theorem 5 For any graph distance function d that satisfies the triangle inequality, the algorithm MinDiam- 
eter achieves an approximation factor of 2 for the Diameter-sTF and Diameter-mTF problems. 

Proof: The analysis we present here is similar to the analysis of the RarestFirst algorithm presented in 
[12]. First, consider the solution X' output by the MinDiameter algorithm, and let Orare G T be the skill 
possessed by the least number of individuals in X. Also, let i* be the individual picked from set S{arare) to 
be included in the solution X' . Now, consider two other skills Oi ^ 02 ^ o-rare and individuals i,i' G X such 
that i £ S{ai),i ^ S{a2) and i' ^ S{ai),i' £ 8(02). If i,i' are part of the team reported by the MinDiameter 
algorithm, it means that d{i*,i) < dk{i* , S{ai),ki) and d(i*,i') < S'(a2), fe). Due to the way the 

algorithm operates, we can lowerbound the Cc-R cost of the optimal solution, X* , as follows: 

d{i*,i) < Cc-R{X*) and d{i*,i') < Cc-R{X*) (1) 

Since we have assumed that the distance function d satisfies the triangle inequality, 
d{i,i') < d{i,i*) +d{i*,i') 

By applying the bounds given in ([Ij, we get the proposed approximation factor. 

dii,i') < Cc-R(A'*)+ Cc-R(A'*) < 2-Cc-R(A'*). I 

Algorithm MinDiameter is as follows. For each individual, say ir € S{arare) where Orare is the rarest skill 
(the skill with the minimum size support set 5), and for each skill Oi G 7", the algorithm finds the distance 
to all the nodes in the support set S{ai). Then, for each support set S{ai), it chooses the fc^-size subset of 
S{ai) such that the maximum shortest path distance between v and the nodes in this subset is minimum 
among all fc^-size subsets of S{ai). We call this distance as fc^-th shortest distance between ir and S{ai) and 
denote it as dk{ir, S{ai), ki). Further, we denote the set of ki shortest paths between v and each of the nodes 
belonging to the corresponding fc^-size subset of S{ai) as Pathk{ir, S{ai),ki). Thus, for each v € S(arare) 
the algorithm has identified ki nodes of skill a^, thereby forming a possible solution team that satisfies the 
constraints. Finally, the algorithm then picks one of these solutions that has minimum diameter. The time 
complexity of the algorithm MinDiameter, assuming that all pairs shortest paths are pre-computed, is 0{n?). 



6 Experiments 

In this section, we evaluate various team formation algorithms using the collaboration graph extracted from 
the DBLP bibliography server. We show that the density of the subgraph returned by our algorithms s- 
DensestAlk and m-DensestAlk pevioTm favorably in comparison to the algorithm MinDiameter. We also show 
that our algorithm for density version provides high-quality results in terms of effective communication and 
collaboration (according to several metrics). In this section, we also present three simple heuristic extensions 
that can be used to process the solutions returned by s-DensestAlk and m-DensestAlk in order to further 
improve these solutions by reducing size and improving connectivity, while maintaining high density. Finally, 
examples of teams reported by our methods qualitatively corroborate the effectiveness of our framework. 

6.1 Experimental Setup 

We use a snapshot of the DBLP data downloaded on May 17, 2010 to create a benchmark data set for our ex- 
periments. We only consider the papers published in the domains of Database (DB), Data Mining (DM), Arti- 
ficial Intelligence (AI) and Theory (T) conferences. We select papers from a total of 21 conferences categorized 

as follows: DB = {SIGMOD, VLDB, ICDE, ICDT, EDBT, PODS}, DM = {wWW, KDD, SDM, PKDD, ICDM}, 

AI = {iCML, ECML,cOLT, UAi}, and T = {soda, focs, stoc, STAGS, IGALP, esa}. We define the skiU set 
T — {t, AI, DB, dm}. The set of skilled individuals X^up consists of the set of authors with at least three 
papers in these domains. Two authors Zi,i2 are connected in the graph Gdbip{Xdbip, E) if they appear as 
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co-authors in at least two papers in DBLP. The above procedure creates a set XdUp consisting of 6137 indi- 
viduals. The maximum component size is 3869. We use this for all the experiments. The skill set Xi of each 
such author i is defined as Xi = {t \ t & T and Pi (t) ^ 0} where Pi (t) denotes the set of papers coauthored 
by i that are published in the conferences in the domain t. 

Maximum Density Team Formation. To evaluate the algorithms [T] and [H for each edge e(ii,i2), we 
set the edge weight 10(11,12) = \Pii n Pi2\, where Pn and Pi2 represent the set of papers published by ii 
and 12 respectively. For the subgraph, say G'{V',E') returned by these algorithms, we calculate the density. 



Minimum Diameter Team Formation. Here, we set edge-weight w{ii,i2) = 1 — jp'lup'21 ^ suggested in 
the paper [T^j . For comparison, when a subgraph G'{V',E') is returned by the MinDiameter, we compute 
its density by considering the induced subgraph on vertices V", say G" (which could contain more edges that 
E'). The density calculated is d" = \v(G")\ ^^^^ edge weights w(ii, 12) ~ \Pii H Pi2\. 

6.2 Heuristic algorithms 

The objectives for sTF-Density and mTF-Density are to find subgraphs with maximum density satisfying the 
skill requirements. However, this does not necessitate a connected graph; disconnectedness makes meaningful 
collaboration in real-life difficult. This is an artifact of the objective function, rather than the algorithm. 
While the solutions returned by our algorithms sTF-Density and mTF-Density never had more than three 
components, we would like solutions with only one component. This is the motivation for heuristic improve- 
ments. A dual benefit in our suggested heuristics is that we are able to reduce the number of nodes in the 
returned subgraph. The hope is that these can be achieved without compromising significantly on the density. 

Algorithm 4 EnhanceComponcnt(G", T) 

1: (Note: T={<a,k >}) 

2: for each component Gi £ G" do 

3: Gl ^ a, m ^ N{C,) ~ G, 

4: (note: N{Gi) denotes neighbors of nodes in Ci) 

5: for each node v ^ Ni do 

6: if \V{Gl)nS{a)\ > k then 

7: C ^ C U G', 

8: break for loop 

9: end if 
10: if V e S{a) then 
11: G[^G[\Jv 
12: end if 
13: end for 
14: end for 



Algorithm 5 EnhancedDense(G', T) 

I: G' ^ s-DensestAlk(G, T) 

2: C <— EnhanceComponent{G' ,T) 

3: Return argmincgg/ |Cj'| 



We present three heuristics. The starting point of each is the solution to sTF-Density or mTF-Density, 
as the case may be. We name these heuristics as EnhancedDense (Algorithm [5]), PartialTrimmedDense (Al- 
gorithm [6]) and CompleteTrimmedDense (Algorithm [^. For simplicity in presentation, the algorithms are 
presented as extensions to s-DensestAlk, but they apply to m-DensestAlk analogously. The basic idea behind 
algorithm EnhancedDense is to inspect each individual component in the solution and attempt to modify 
it so that it itself satisfies the skill set requirement imposed by the task T. This is done by examining the 
neighbors of the nodes in the component and adding those neighbors that are skilled nodes. The heuristics 
PartialTrimmedDense and CompleteTrimmedDense, take as an input the components generated by the al- 
gorithm EnhanceComponent (Algorithm \^ and attempt to reduce the size of each component by removing 
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the non-skilled nodes one by one without making the component disconnected. The PartialTrimmedDense 
algorithm allows at most k non-skilled nodes in the component whereas CompleteTrimmedDense attempts 
to remove as many non-skilled nodes as possible. The smallest resulting component with the required skilled 
nodes is then picked. This helps reduce the size of the solution, which is now a single component, and 
hopefully still sufficiently dense since the heuristic started with a 3-approximation to the density objective. 



Algorithm 6 PartialTrimmedDensc(G, T) 

1: (Note: T = {<a,k >}) 

2: G' ^ s-Den.sestAlk(G, T) 

3: C EnhanceComponent{G' ,T) 

4: for each component £ C do 

5: Q -/^ {u \ u e CI and u ^ S{a)} 

6: vk^hile Q not empty and \V{Gl) - 5(a) | > fc do 

7: Umin <~ pop lowest degree node from Q 

8: if (C- — Umin) IS couuected then 

9: CI Cj' — Umin 

10: end if 

11: end while 

12: it \V {CD - S{a)\> k then 

13: C - C[ 

14: end if 

15: end for 

16: Return aigmaxc'tzc' density{Gl) 



Algorithm 7 CompleteTrimniedDense(G, T) 

1: (Note: T = {< a,fc >}) 

2: G" ^ s-DensestAlk(G,T) 

3: C <^ EnhanceComponent{G' ,T) 

4: for each component C'^ G C do 

5: Q ^ F(CO - S{a) 

6: while Q is not empty do 

7: Umin pop lowest degree node from Q 

8: if (Cj' — Umin) IS Connected then 

9: Cj' <— CI — Umin 

10: end if 

11: end while 

12: end for 

13: Return argmin^gc' 



6.3 Single Skill Team Formation 

We run the single skill experiments for k G {3, 5, 7, 9, 11, 13, 15}. For each value of k, we have a separate run 
for each skill a e {t, AI, db, dm}. We calculate statistics, such as density, size, and number of connected 
components for each solution and present the mean over these four runs as the final statistic. 

Figures [Sfa) and|3l[b) show {k vs. density) and {k vs. size) plots, respectively. From these plots, we can 
see that the density obtained by s-DensestAlk significantly outperforms the density obtained by MinDiameter 
algorithm. This is of course expected. However, the downside is that the size of the solution to s-DensestAlk 
is also larger (and in some cases disconnected). The heuristic EnhancedDense essentially adds neighbors to 
each component in the solution so that the resulting component satisfies the required skill-set and then picks 
the one with the smallest size. Therefore connectivity is guaranteed. Further, the reduction in density is 
not much and even the cardinality has reduced compared to the original solution. This also means that the 
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Figure 3: single skill experiments 



solution returned by s-DensestAlk contained a good component to start with - by good component we mean 
a component that has most of the skills satisfied and has high density. 

Now, notice that by applying heuristics PartialTrimmedDense and CompleteTrimmedDense^ we attempt 
to remove the non-skilled nodes one by one from each of these enhanced components (while maintaining 
connectivity). As the plots show again, this serves the purpose of significantly reducing the cardinality of 
the solution and as a hard constraint the algorithm still satisfies the skill requirement. It can be observed 
from the plots that PartialTrimmedDense has density almost equal to the s-DensestAlk and the cardinality 
is reduced by more than fifty percent. Further, CompleteTrimmedDense gives a solution that has cardinality 
almost equal to k (which would be optimal), with very little reduction in density. Finally, we plot (fc vs. 
density per node) in Figure [31[c). While this figure can be deduced, we present it to highlight the observation 
that the heuristics reduce the cardinality without compromising on the density. Notice that in this plot, 
CompleteTrimmedDense has the highest value of density per node, for every value of k. 

Given that density is intuitively a better measure of team collaboration, these results show that we are 
completely able to eliminate connectivity issues inherent in this objective, and output small yet sufficient, 
and highly collaborative (dense) teams. 

6.4 Multiple Skill Team Formation 

We run the multiple skill experiments for k e {3, 8, 13, 18, 23, 28} and for each run, wc randomly choose k 
skills from A — {t, AI, db, dm}. For example, when fc = 3, we may choose a skill (multi)set {t, t, dm} 
which means we want a subgraph that contains at least two authors of skill T and one author of skill DM . 
Recall that a given author can have multiple skills and therefore the solution may consist of a subgraph whose 
size is less than the value of k. 




^ g T3 TS 25 28 3 g T3 T8 25 W 



(a) k vs. density (b) k vs. size 

Figure 4: multiple skills experiments 



Figures HJa) andlUJb) plots (fc vs. density) and (fc vs. size), respectively, for multiple skill team formation 
experiments. Note that the plots for multiple skill experiments fluctuate more than single skill experiments. 
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This is due to the randomness in picking the multiple skills requirements. Also, some solutions returned are 
of the same size even as k is increased. This is because sometimes the same solution satisfies different required 
skill sets. 

In these figures, we again see that m-DensestAlk algorithm has the highest density. Note that the solution 
with density and size 1 corresponds to an individual that has all the required skills. Further, similar 
to single skill experiments, we apply the heuristics mentioned earlier in order to get a connected subgraph 
without compromising on the density much. Figure Ub) shows that the heuristics have been effective in 
reducing cardinality. In fact, the cardinality of the solution obtained by CompleteTrimmedDense is lesser 
than k because a single individual can satisfy more than one skills. Further, for the A: > 13 tasks, the density 
achieved by the heuristics is also close to that of m- Dens est Alk. While sometimes certain heuristics have 
low density (e.g., k — 3 oi k = 8), all heuristics offer a nice trade-off between size and density (and return 
connected solutions by design). For each value of k, there exists at least one solution with density close to 
maximum-density and small cardinality. We omit the density per node plot here due to lack of space, and 
because it can be deduced from Figures IH^ a) , (b). 

6.5 Density Vs. Diameter Analysis 

In the previous sections, we demonstrated the effectiveness of various heuristic algorithms in order to obtain 
a solution subgraph that is connected, small and dense. The intuition behind suggesting the density as a 
metric for team collaborative compatibility is that a denser graph has more edges between nodes, resulting 
in a greater possibility for collaboration. Small diameter does not necessarily guarantee this property. In 
this section, we consider three metrics for comparing Density and Diameter based approaches: teamPubs, 
partialTeamPubs and teamPubRatio. The metric teamPubs defines the number of publications where all 
the authors of the publication belong to the solution subgraph. partialTeamPubs defines the number of 
publications where at least half of the authors of the publication belong to the solution subgraph. These 
two metrics give a good indication of the collaboration compatibility of reported teams. In addition, we 
propose another metric teamPubRatio which is essential for the comparative study because it is affected by 
not only the team-members' collaboration compatibility but also on the size of the team. In this case, for each 
publication, say p', we compute the ratio of where X' is the set of authors in the solution subgraph 

and A' is the set of authors of the publication p' . That is, teamPubRatio measures the Jaccard similarity 
between a publication's author set and a team's author set. We then take the average of this quantity over 
all the publications. 

We now describe the details of the evaluation strategy used to calculate these metrics. For both single 
skill team formation and multiple skill team formation problems, we consider the teams that were proposed 
as a solution in the previously described experiments. In particular, we consider the solutions reported by the 
algorithms CompleteTrimmedDense and MinDiameter. We choose only CompleteTrimmed-Dense algorithm 
for density because it reports the smallest solutions. The goal is to establish that the small teams obtained 
by CompleteTrimmedDense also achieve superior results for the three metrics of collaboration compatibility 
mentioned above. The results of metric evaluation are shown in the plots [S]and [Slfor single skill and multi 
skill experiments, respectively. In each plot, value of k is plotted along the a;-axis and the value of the the 
metrics for the corresponding solution subgraphs are along the y-axis. In case of single skill experiments, for 
each fc, the metric value reported is the average of metric values for the solutions corresponding to each of 
the skills { T, AI, DB, DM }. Further, for the metrics teamPubs and partialTeamPubs the y-axis defines the 
resulting number of publications whereas for the metric teamPubRatio, the y-axis defines the scaled (100000 
times) metric value. From these plots it can be observed that in both single skill and multi skill team formation 
problems, the algorithm CompleteTrimmedDense consistently outperforms the algorithm MinDiameter iov all 
the three metrics. In case of single skill, for each of the three metrics, and for most values of k, the metric value 
for CompleteTrimmedDense is about twice that of MinDiameter. In multi skill, the variation is somewhat 
larger, but CompleteTrimmedDense consistently displays superior metric values for all cases. Recall that the 
size of the solution teams by both these algorithms were very similar (and the metric teamPubRatio does 
not necessarily benefit with larger team size); therefore, these experiments suggest that density-based team 
formation leads to teams with better collaborative compatibility than the diameter-based team formation. 
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Figure 5: Single Skill Density vs. Diameter Analysis 




6.6 Qualitative evidence 

To analyze the quality of teams that are returned by our algorithms for maximum density, we refer to the Most 
Cited Computer Science Authors list maintained by CiteSeerX (citeseerx.ist.psu.edu/stats/authors?all=true) 
which contains most cited 10000 authors. We also refer to the list Central Authors: Computer Science (all- 
time) published at (confsearch.org/confsearch/ca.jsp) [11]. This list contains 1000 researchers ranked on the 
basis of DBLP publications. 

We examine the authors of teams returned by s-DensestAlk and m-DensestAlk algorithms in order to 
determine how many authors in the team are among top 500 and top 1000 most cited authors according to 
the list maintained by CiteSeerX. Due to space constraints, we present only some representative lists from 
single skill team formation in Table[TJ The lists are for fc = 3 for T and DB, and for k = 15 for DM and AI. 
Team members who appear among the top 500 and 1000 cited authors are indicated by bold and italic font, 
respectively. We can see from these results that in each team, we have many top cited and prolific/famous 
authors (who may not be in the top 1000 list). These results show that teams formed by choosing the objective 
of maximum density subgraph are intuitively meaningful. 

Complementary results are seen on using the second list, i.e. a list of top 1000 ranked researchers [11] • 
Instead of presenting another table with author names corresponding to this list, we adopt a different approach 
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Skills Authors 



Table 1: Teams reported by s-DensestAlk. 



T(3) Prabhakar Raghavan, Ravi Kumar, Philip S. Yu, D. Sivakumar, Sridliar Rajagopalan, 
Andrew Tomkins 

DB(3) Philip S. Yu, Haixun Wang, Jiawei Han, Xifeng Yan, Wei Fan, Hong Cheng, 
Charu C. Aggarwal 

DM(15) Jiawei Han, Zheng Chen, Haixun Wang, Philip S. Yu , Amr El Abbadi, 

Benyu Zhang,Wei Fan, Jun Yan, Shuicheng Yan, Hong Cheng, Qiang Yang, Ning Liu, 
Jian Pei, Charu C. Aggarwal, Xifeng Yan, Divyakant Agrawal 

AI(15) Ravi Kumar, Ronald Fagin, Philip S. Yu, Christos Faloutsos, Zheng Chen, 

Wei-Ying Ma, Andrei Z. Broder, Jian-Tao Sun, Hongjun Lu, Don Shen, Shuicheng Yan, 
Anthony K. H. Tung, Wei Fan, Sridhar Rajagopalan, Qiang Yang, Eli Upfal, 
Andrew Tomkins, Jure Leskovee 

Table 2: Team ranks based on top-ranked authors. 



Skills 


{s/m}- 


CompleteTrimmed 


Min 




DensityAlk 


Dense 


Diameter 


T(3) 


23.42 


8.11 





AI(3) 


20.81 


17.34 





DB(3) 


18.25 


18.25 





DM(3) 


18.25 


18.25 





T(15) 


14.95 


19.67 


2.05 


AI(15) 


15.25 


14.48 


1.86 


DB(15) 


10.54 


10.80 


0.75 


DM(15) 


9.55 


9.93 


1.05 


T(1),DB(1), 


18.25 


100 


24.39 


DM(1) 








T(8),AI(6), 


9.49 


6.3 


4.1 


DB(8),DM(6) 









for measuring quality. We determine the overall rank of a team using the ranks of the individual authors 
within the team. To be specific, we compute the mean reciprocal rank of all the skilled individuals in the team 

— 

and report the final rank of the team as r = 1000 ^ where denotes the rank of a skilled individual and 
denotes the skilled individuals in the team. Similar findings are observed if this quantity includes non-skilled 
nodes as well. We report the ranks observed in Table O Our original algorithms for maximum density and 
the subsequent heuristics form a team of highly ranked authors and perform significantly better than the 
minimum-diameter algorithm. The validation of these algorithms over two different qualitative approaches 
provides further credence to this framework of team formation using a density based objective. 

7 Conclusions and Future Work 

We presented a novel approach for skilled collaborative team formation based on finding dense subgraphs. On 
the theoretical front, we showed constant factor approximation algorithms. On the practical side, we showed 
several heuristic improvements to our main provable algorithm, and compared it to the previous approach 
based on identifying small diameter subgraphs. Our experimental results show that the densest subgraph 
approach significantly outperforms the previous techniques on multiple different measures of collaborative 
compatibility. 

The formulations in this paper as well as |12j assume that for any given skill, each node in the network 
is either skilled or not skilled. A nice generalization would be to consider a range of expertise for any skill, 
modeled as a value between and 1. Another specific open question is to present more efficient algorithms for 
all objectives. Further, these definitions can be extended along many dimensions. In reality a team's value 
depends on several complex assets such as cultural backgrounds, geographical location, personalities, ability 
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to work in teams etc. Some of these characteristics cannot even be measured easily. Yet, while the current 
models arc a good start, it would be nice to investigate these directions and move closer to the motivating 
realistic scenario. 
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